Comparison of statistical methods for the detection of differentially methylated regions.
Comparison of statistical methods for the detection of differentially methylated regions.
Introduction
DNA methylation
DNA methylation (DNAme) is a process by which methyl groups are added to the DNA molecule. Methylation could change the activity of the DNA without changing its code. It serves as a critical phenomenon to several biological functions and has a role in certain disease states. Further, DNAme is a privileged candidate for epigenetic inheritance due to its plasticity and stability across mitosis and meiosis. It is one of the major mechanisms of epigenetic modification and has a fundamental influence on gene expression, genomic stability and cellular activity (Leek et al. 2010; Lavie et al. 2004).
With the advancement of next-generation (NGS) sequencing techniques, there are several methods for studying DNA methylation, but, few offer a better resolution of methylation status such as bisulfite sequencing (also known as Bisulfite-Seq, BS-Seq, Methyl-Seq or WGBS). The key idea of the method is combining the power of high-throughput DNA sequencing with the treatment of DNA with sodium bisulfite.
Whole Genome Bisulphite Sequencing
Whole Genome Bisulphite Sequencing (WGBS) is a gold standard NGS technique for studying DNAme in the genome of mammals. WGBS combines sodium bisulfite conversion of the DNA sequence with high throughput DNA sequencing. The sodium bisulfite reaction converts the unmethylated cytosines (C) to uracils (U), whereas, methylated C remains unchanged. These U are further converted to thymine (T) after the polymerase chain reaction (PCR). Although, no change occurs for methylated C. Methylcytosines are the methylated versions of the cytosine bases. This is done by transferring a methyl group onto the C5 position of the cytosine.
The aim of NGS-based DNA methylation analysis is to investigate genomic DNA and find out whether single cytosines or entire regions in the genome are methylated or not. WGBS is the most comprehensive sequencing for DNA methylation profiling, allowing single-base resolution of 5-mC within the whole genome. By comparing treated and untreated sequences, the location of the methylated cytosines is determined.
There are several applications of WGBS, including (not limited to):
detection of differentially methylated regions (DMRs)
detection of differentially methylated locis (DMLs)
detection of copy number variations (CNVs)
detection of single nucleotide polymorphisms (SNPs)
detection of cytosine methylation levels of transcription factor binding sites (TFBSs)
detecting methylation level or distribution (globally, single nucleotide, chromosome-wide and gene-centric)
Sequencing the whole genome is generally quite expensive. Therefore, although it has been applied to large genomes such as the human genome, large numbers of individual samples are seldom sequenced. Reduced representation bisulfite sequencing (RRBS) has been developed for this reason, in which the bisulfite reaction occurs but the sequencing is limited to around 1% of the genome. This enables the sequencing of the genome of several individuals.
Basic pipeline for WGBS data analysis
library(DiagrammeR)
grViz("
digraph flowchart {
graph [overlap = true, layout=dot,fontsize=12]
node [shape = box, style=filled, fillcolor=NavajoWhite, color=DarkslateGray, fontsize=12];
A; B; E; F; G;
node [shape = egg, style=filled, fillcolor=NavajoWhite, color=DarkslateGray, fontsize=12];
C; D;
A [label='Quality Check: FastQC']
B [label='Quality Control: trim_galore']
C [label='Alignment: Bismark']
D [label='Methylation extraction: Bismark']
E [label='Overview of QC: MultiQC']
F [label='Differential analysis: DMLs']
G [label='Differential analysis: DMRs']
edge[color=DarkslateGray,
penwidth=5
]
A->B
B->C[label = ' trim adapters, remove shorter reads, trim Ns']
C->D
C->E
A->E
B->E
D->F[label = ' DMLs: single basepair resolution']
D->G[label = ' DMRs: identify regions']
}
")Figure 1. Basic pipeline for WGBS data analysis.
Limitations of current methods
Methods are greatly hindered by low sample size. Most of the methods depend on large sample sizes.
The number of tests performed = the number of loci analyzed. Example: Human genome has ~30 million CpGs (Smith and Meissner 2013).
Measurements are spatially correlated across the genome (Leek et al. 2010), but in most methods measurements from all loci are treated independently.
Multiple testing corrections without considering the spatial correlation can result in a loss of power.
DMRs more biologically relevant than DMLs.
DML approaches may construct DMRs by chaining together neighboring significant loci, but this type of approach will not yield a proper assessment of the statistical significance of the constructed regions, nor will the FDR be properly controlled (Robinson et al. 2014).
Controlling the FDR at the level of individual loci is not the same as controlling FDR of regions.
Methylation data cannot be modeled either by Gaussian models (due to low coverage) or by Binomial models (do not account for biological variability). Beta-Binomial models are computationally difficult.
Challenges for assessing DMRs
defining region boundaries
methods ignore correlation across loci
biological variability from sample to sample
should be powerful even in the case of 2 samples per group.
Methods
Data
Two data sets were used in this project.
First, representing negative control (with no expected DMRs), contained six methylation profiling samples from normal human dendritic cells (data in GEO). The samples were artificially divided into two groups of three samples, Group1 consisting of samples GSM1565940, GSM1565944, and GSM1565948, Group2 consisting of samples GSM1565942, GSM1565946, and GSM1565950. To obtain reasonable running times, only methylation data from chromosome 18 were used, similar to the dmrseq paper (Korthauer et al. 2018).
The second set of data was created from the negative control by adding 100 simulated DMRs using simDMRS function from the R package dmrseq.
The reported DMRs were (if possible) filtered for only those that contain at least 10 CpGs and the mean difference between the two groups is at least 0.1 (the simulated regions had effect sizes between 0.163 and 0.450).
The methods used for comparison
In total 7 different methods were used for the comparison. Their main features, as well as links to the analysis files are summarized in the Table 1:
| Methods | Preprocessing + statistical test used | p-value estimate given | CpG / non CpG | Implementation | Reference | Availability | Analysis file |
|---|---|---|---|---|---|---|---|
| bsseq | local-likelihood smoothing, t-test similar statistics | no | CpG only | R | (Hansen, Langmead, and Irizarry 2012) | Bioconductor package bsseq | Report |
| CGmapTools | unpaired t-test | yes | CpG only | C, C++, Python | (Guo et al. 2017) | binary package | Report |
| defiant | Weighted Welch Expansion | yes | CpG only | Shell | (Condon et al. 2018) | binary package | Report |
| DMRcaller | no preprocessing, or pooling in bins, or kernel smoothing; Fisher’s exact or Score test | yes | all contexts | R | (Catoni et al. 2018) | Bioconductor package DMRcaller | Report |
| dmrseq | smoothing; generalized least squares | yes | CpG only | R | (Korthauer et al. 2018) | Bioconductor package dmrseq | Report |
| methPipe | Fisher’s exact test | no | CpG only | C, C++ | (Song et al. 2013) | binary package | Report |
| methylKit | Fisher’s exact test, or logistic regression with tiling | yes | all contexts | R | (Akalin et al. 2012) | Bioconductor package methylKit | Report |
Results and discussion
Diagnostic plots
Distribution of the coverage and beta values of all the samples, shown in Figure 2A and Figure 2B and Figure 3A and Figure 3B, respectively, depicts that there is not much difference between individual samples. Also, the PCA plot, Figure 4A and Figure 4B, shows that there is no clustering of the two artificially created groups.
Coverage
Negative Control
Simulated data
Beta values
Negative Control
Simulated data
PCA plots
Negative Control
DMRs identified by the methods
There was no single region that would all methods overlap by at least 1bp.
Figure 5 shows one simulated DMR, generated as described in the Data section. This region has the highest delta (methylation difference) value in the list of all generated DMRs (0.447). This region was identified as DMR by 6 methods (not identified by methpipe; also not identified by DMRcaller bins method).
Similar plots for all 100 simulated DMRS are shown in this report.
Comparison of different methods
We considered any overlap between the simulated and identified DMRs to be sufficient for the region to be classified as true positive.Details of computing true positives and false positives are described in this report.
Negative Control
As per negative control, we do not expect any true DMRs, all the identified regions are thus false positives. dmrseq and defiant methods did not find any DMRs in the negative control data, whereas others did, as shown in Figure 6 and Table 2.
Barplot
Table
Simulated Data
FDR vs Power
Figure 7 shows that dmrseq performs better than the other methods. On x axis is the observed FDR, on y axis the power, both for different specified FDR cut-offs. The computation of FDR and power is described here.
FDR: Specified vs Observed
We have also checked how well the methods control FDR, Figure 8. For this, we plotted the observed FDR (the proportion of false discoveries) and the specified FDR (cutoff after multiple testing correction by BH). We can see that dmrseq again outperforms the other methods.
BSseq and methpipe do not allow user control of FDR, and are thus not depicted in the Figureure.
More comparison plots as well as the code can be found in this report.
Conclusion
In this project, we wanted to compare the methods for identification of DMRs, which were not compared in the dmrseq paper (Korthauer et al. 2018). We used nearly all methods available, except the ones which do not have a proper statistical description or have not been updated for a long time. We found that dmrseq outperforms all the methods that we compared. This is especially important when we take into acccount that we used just 3 samples per group, which is the most one can hope for in the real experiments (due to the high cost of WGBS). But, it has to be mentioned that the data we used were the same as in the original dmrseq paper (Korthauer et al. 2018) and that the simulated DMRs were created using a function which is a part of the dmrseq method. It would be nice and useful to repeat the experiment with some independently generated data.
All the methods were well documented and were easy to use for a non-experienced user, except for CGmapTools, where the documentation was not clear.
References
Akalin, Altuna, Matthias Kormaksson, Sheng Li, Francine E Garrett-Bakelman, Maria E Figueroa, Ari Melnick, and Christopher E Mason. 2012. “methylKit: A Comprehensive R Package for the Analysis of Genome-Wide DNA Methylation Profiles.” Genome Biology 13 (10). Springer Nature: R87. https://doi.org/10.1186/gb-2012-13-10-r87.
Catoni, Marco, Jonathan MF Tsang, Alessandro P Greco, and Nicolae Radu Zabet. 2018. “DMRcaller: A Versatile R/Bioconductor Package for Detection and Visualization of Differentially Methylated Regions in CpG and Non-CpG Contexts.” Nucleic Acids Research, July. Oxford University Press (OUP). https://doi.org/10.1093/nar/gky602.
Condon, David E., Phu V. Tran, Yu-Chin Lien, Jonathan Schug, Michael K. Georgieff, Rebecca A. Simmons, and Kyoung-Jae Won. 2018. “Defiant: (DMRs: Easy, Fast, Identification and ANnoTation) Identifies Differentially Methylated Regions from Iron-Deficient Rat Hippocampus.” BMC Bioinformatics 19 (1). Springer Nature. https://doi.org/10.1186/s12859-018-2037-1.
Guo, Weilong, Ping Zhu, Matteo Pellegrini, Michael Q Zhang, Xiangfeng Wang, and Zhongfu Ni. 2017. “CGmapTools Improves the Precision of Heterozygous SNV Calls and Supports Allele-Specific Methylation Detection and Visualization in Bisulfite-Sequencing Data.” Edited by Inanc Birol. Bioinformatics 34 (3). Oxford University Press (OUP): 381–87. https://doi.org/10.1093/bioinformatics/btx595.
Hansen, Kasper D, Benjamin Langmead, and Rafael A Irizarry. 2012. “BSmooth: From Whole Genome Bisulfite Sequencing Reads to Differentially Methylated Regions.” Genome Biology 13 (10). Springer Nature: R83. https://doi.org/10.1186/gb-2012-13-10-r83.
Korthauer, Keegan, Sutirtha Chakraborty, Yuval Benjamini, and Rafael A Irizarry. 2018. “Detection and Accurate False Discovery Rate Control of Differentially Methylated Regions from Whole Genome Bisulfite Sequencing.” Biostatistics, February. Oxford University Press (OUP). https://doi.org/10.1093/biostatistics/kxy007.
Lavie, L., M. Kitova, E. Maldener, E. Meese, and J. Mayer. 2004. “CpG Methylation Directly Regulates Transcriptional Activity of the Human Endogenous Retrovirus Family HERV-K(HML-2).” Journal of Virology 79 (2). American Society for Microbiology: 876–83. https://doi.org/10.1128/jvi.79.2.876-883.2005.
Leek, Jeffrey T., Robert B. Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A. Irizarry. 2010. “Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data.” Nature Reviews Genetics 11 (10). Springer Nature: 733–39. https://doi.org/10.1038/nrg2825.
Robinson, Mark D., Abdullah Kahraman, Charity W. Law, Helen Lindsay, Malgorzata Nowicka, Lukas M. Weber, and Xiaobei Zhou. 2014. “Statistical Methods for Detecting Differentially Methylated Loci and Regions.” Frontiers in Genetics 5 (September). Frontiers Media SA. https://doi.org/10.3389/fgene.2014.00324.
Smith, Zachary D., and Alexander Meissner. 2013. “DNA Methylation: Roles in Mammalian Development.” Nature Reviews Genetics 14 (3). Springer Nature: 204–20. https://doi.org/10.1038/nrg3354.
Song, Qiang, Benjamin Decato, Elizabeth E. Hong, Meng Zhou, Fang Fang, Jianghan Qu, Tyler Garvin, Michael Kessler, Jun Zhou, and Andrew D. Smith. 2013. “A Reference Methylome Database and Analysis Pipeline to Facilitate Integrative and Comparative Epigenomics.” Edited by Osman El-Maarri. PLoS ONE 8 (12). Public Library of Science (PLoS): e81148. https://doi.org/10.1371/journal.pone.0081148.